NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Optimizing context-based location extraction by tuning open-source LLMs with RAG

https://doi.org/10.1080/17538947.2025.2521786

Wang, Zifu; Masri, Yahya; Malarvizhi, Anusha Srirenganathan; Stover, Tayven; Ahmed, Samir; Wong, David; Jiang, Yongyao; Li, Yun; Bere, Mathieu; Rothbart, Daniel; et al (August 2025, International Journal of Digital Earth)

Full Text Available
Comparative Analysis of BERT and GPT for Classifying Crisis News with Sudan Conflict as an Example

https://doi.org/10.3390/a18070420

Masri, Yahya; Wang, Zifu; Srirenganathan_Malarvizhi, Anusha; Ahmed, Samir; Stover, Tayven; Wong, David_W S; Jiang, Yongyao; Li, Yun; Liu, Qian; Bere, Mathieu; et al (July 2025, Algorithms)

To obtain actionable information for humanitarian and other emergency responses, an accurate classification of news or events is critical. Daily news and social media are hard to classify based on conveyed information, especially when multiple categories of information are embedded. This research used large language models (LLMs) and traditional transformer-based models, such as BERT, to classify news and social media events using the example of the Sudan Conflict. A systematic evaluation framework was introduced to test GPT models using Zero-Shot prompting, Retrieval-Augmented Generation (RAG), and RAG with In-Context Learning (ICL) against standard and hyperparameter-tuned bert-based and bert-large models. BERT outperformed GPT in F1-score and accuracy for multi-label classification (MLC) while GPT outperformed BERT in accuracy for Single-Label classification from Multi-Label Ground Truth (SL-MLG). The results illustrate that a larger model size improves classification accuracy for both BERT and GPT, while BERT benefits from hyperparameter tuning and GPT benefits from its enhanced contextual comprehension capabilities. By addressing challenges such as overlapping semantic categories, task-specific adaptation, and a limited dataset, this study provides a deeper understanding of LLMs’ applicability in constrained, real-world scenarios, particularly in highlighting the potential for integrating NLP with other applications such as GIS in future conflict analyses.
more » « less
Full Text Available
Is ChatGPT a Good Geospatial Data Analyst? Exploring the Integration of Natural Language into Structured Query Language within a Spatial Database

https://doi.org/10.3390/ijgi13010026

Jiang, Yongyao; Yang, Chaowei (January 2024, ISPRS International Journal of Geo-Information)

With recent advancements, large language models (LLMs) such as ChatGPT and Bard have shown the potential to disrupt many industries, from customer service to healthcare. Traditionally, humans interact with geospatial data through software (e.g., ArcGIS 10.3) and programming languages (e.g., Python). As a pioneer study, we explore the possibility of using an LLM as an interface to interact with geospatial datasets through natural language. To achieve this, we also propose a framework to (1) train an LLM to understand the datasets, (2) generate geospatial SQL queries based on a natural language question, (3) send the SQL query to the backend database, (4) parse the database response back to human language. As a proof of concept, a case study was conducted on real-world data to evaluate its performance on various queries. The results show that LLMs can be accurate in generating SQL code for most cases, including spatial joins, although there is still room for improvement. As all geospatial data can be stored in a spatial database, we hope that this framework can serve as a proxy to improve the efficiency of spatial data analyses and unlock the possibility of automated geospatial analytics.
more » « less
Full Text Available
A Query Understanding Framework for Earth Data Discovery

https://doi.org/10.3390/app10031127

Li, Yun; Jiang, Yongyao; Goldstein, Justin C.; Mcgibbney, Lewis J.; Yang, Chaowei (February 2020, Applied Sciences)

One longstanding complication with Earth data discovery involves understanding a user’s search intent from the input query. Most of the geospatial data portals use keyword-based match to search data. Little attention has focused on the spatial and temporal information from a query or understanding the query with ontology. No research in the geospatial domain has investigated user queries in a systematic way. Here, we propose a query understanding framework and apply it to fill the gap by better interpreting a user’s search intent for Earth data search engines and adopting knowledge that was mined from metadata and user query logs. The proposed query understanding tool contains four components: spatial and temporal parsing; concept recognition; Named Entity Recognition (NER); and, semantic query expansion. Spatial and temporal parsing detects the spatial bounding box and temporal range from a query. Concept recognition isolates clauses from free text and provides the search engine phrases instead of a list of words. Name entity recognition detects entities from the query, which inform the search engine to query the entities detected. The semantic query expansion module expands the original query by adding synonyms and acronyms to phrases in the query that was discovered from Web usage data and metadata. The four modules interact to parse a user’s query from multiple perspectives, with the goal of understanding the consumer’s quest intent for data. As a proof-of-concept, the framework is applied to oceanographic data discovery. It is demonstrated that the proposed framework accurately captures a user’s intent.
more » « less
Full Text Available
Improving search ranking of geospatial data based on deep learning using user behavior data

https://doi.org/10.1016/j.cageo.2020.104520

Li, Yun; Jiang, Yongyao; Yang, Chaowei; Yu, Manzhu; Kamal, Lara; Armstrong, Edward M.; Huang, Thomas; Moroni, David; McGibbney, Lewis J. (September 2020, Computers & Geosciences)
null (Ed.)
Full Text Available
A graph-based approach to detecting tourist movement patterns using social media data

https://doi.org/10.1080/15230406.2018.1496036

Hu, Fei; Li, Zhenlong; Yang, Chaowei; Jiang, Yongyao (March 2019, Cartography and Geographic Information Science)

Full Text Available
A hierarchical indexing strategy for optimizing Apache Spark with HDFS to efficiently query big geospatial raster data

https://doi.org/10.1080/17538947.2018.1523957

Hu, Fei; Yang, Chaowei; Jiang, Yongyao; Li, Yun; Song, Weiwei; Duffy, Daniel Q.; Schnase, John L.; Lee, Tsengdar (March 2020, International Journal of Digital Earth)
null (Ed.)
Full Text Available
Big Earth data analytics: a survey

https://doi.org/10.1080/20964471.2019.1611175

Yang, Chaowei; Yu, Manzhu; Li, Yun; Hu, Fei; Jiang, Yongyao; Liu, Qian; Sha, Dexuan; Xu, Mengchao; Gu, Juan (April 2019, Big Earth Data)

Full Text Available
A Cloud-Based Framework for Large-Scale Log Mining through Apache Spark and Elasticsearch

https://doi.org/10.3390/app9061114

Li, Yun; Jiang, Yongyao; Gu, Juan; Lu, Mingyue; Yu, Manzhu; Armstrong, Edward; Huang, Thomas; Moroni, David; McGibbney, Lewis; Frank, Greguska; et al (March 2019, Applied Sciences)

The volume, variety, and velocity of different data, e.g., simulation data, observation data, and social media data, are growing ever faster, posing grand challenges for data discovery. An increasing trend in data discovery is to mine hidden relationships among users and metadata from the web usage logs to support the data discovery process. Web usage log mining is the process of reconstructing sessions from raw logs and finding interesting patterns or implicit linkages. The mining results play an important role in improving quality of search-related components, e.g., ranking, query suggestion, and recommendation. While researches were done in the data discovery domain, collecting and analyzing logs efficiently remains a challenge because (1) the volume of web usage logs continues to grow as long as users access the data; (2) the dynamic volume of logs requires on-demand computing resources for mining tasks; (3) the mining process is compute-intensive and time-intensive. To speed up the mining process, we propose a cloud-based log-mining framework using Apache Spark and Elasticsearch. In addition, a data partition paradigm, logPartitioner, is designed to solve the data imbalance problem in data parallelism. As a proof of concept, oceanographic data search and access logs are chosen to validate performance of the proposed parallel log-mining framework.
more » « less
Full Text Available
An Integrated Data Analytics Platform

https://doi.org/10.3389/fmars.2019.00354

Armstrong, Edward M.; Bourassa, Mark A.; Cram, Thomas A.; DeBellis, Maya; Elya, Jocelyn; Greguska, Frank R.; Huang, Thomas; Jacob, Joseph C.; Ji, Zaihua; Jiang, Yongyao; et al (July 2019, Frontiers in Marine Science)

Full Text Available

Search for: All records